Quantifying collective identity online from self-defining hashtags

Mass communication over social media can drive rapid changes in our sense of collective identity. Hashtags in particular have acted as powerful social coordinators, playing a key role in organizing social movements like the Gezi park protests, Occupy Wall Street, #metoo, and #blacklivesmatter. Here we quantify collective identity from the use of hashtags as self-labels in over 85,000 actively-maintained Twitter user profiles spanning 2017–2019. Collective identities emerge from a graph model of individuals’ overlapping self-labels, producing a hierarchy of graph clusters. Each cluster is bound together and characterized semantically by specific hashtags key to its formation. We define and apply two information-theoretic measures to quantify the strength of identities in the hierarchy. First we measure collective identity coherence to determine how integrated any identity is from local to global scales. Second, we consider the conspicuousness of any identity given its vocabulary versus the global identity map. Our work reveals a rich landscape of online identity emerging from the hierarchical alignment of uncoordinated self-labeling actions.


Graph metrics
Our graphs consist of user nodes, where edges between users indicate the co-use of at least one hashtag with another user in self-description profiles. We begin with 91,093 user nodes, with largest connected component consisting of 89,647 nodes; component size distribution is shown in Table 1. We then perform all our analyses on the 2-core of the giant component, removing users that share hashtags with only one other user. Table 2 contains graph metrics for that 2-core. Its path length histogram is shown in Figure 1; degree distribution and associated heavy-tail fit information are shown in Figure 2 and Table 3.      [2,3,1], the heavy-tail behavior in the degree distribution is decidable between stretched exponential, truncated power law, and lognormal models (see Table 3 for log-likelihood comparison results).  Table 3: Log-likelihood fit comparisons for an array of heavy-tail models on the degree distribution shown in Figure 2. Each cell shows (R, p): R is the log-likelihood ratio of the ith row model fit against that of the jth column. Positive R indicates a preference for the ith row model, with negative indicating the jth column model. p indicates the p-value associated with each R as calculated in [1]. Truncated power law, stretched exponential, and lognormal are preferred over exponential and power law, although no preference is shown among the former three. Figure 3 shows the graph of 1st-level (local) clusters with all edges shown (instead of the backbone as in Fig. 2, main text). Figure 4 shows the size distribution of these first-level clusters and for the two higher levels in the Louvain hierarchy. Figure 5 shows the distribution of collective coherence measured, as well as the threshold used to label high-coherence clusters in Fig. 2a, main text. Figure 6 contains an explanatory diagram of hierarchical identity structure, and Figure 7 contains details of null conspicuousness sample sizes used to create the null conspicuousness CI band.     CI ranges are plotted for only those bins with more than 100 conspicuousness values. Bottom: generated conspicuousness sample size per null community size bin. The horizontal dashed line represents the threshold of 100 samples needed to plot the CI in the top plot corresponding to any bin. Using this threshold, more than 30,000 simulations were needed to produce the CI band covering the range of empirical community sizes (up to roughly 10,000 users).

Coherence using an SBM hierarchical partition
In the main text, we produce a clustering hierarchy using the Louvain algorithm, and compare clusters from the bottom level against the top level to produce coherence scores. Our analysis is dependent on the Louvain algorithm to find clusters representing assortative communities of self-labelers, so to strengthen our findings we compare against coherence measurements from a hierarchical stochastic block model (SBM) [5,6,7]. All SBM calculation is performed with the graph-tool library [4], using a geometrically-distributed edge covariate model to represent positive integer edge weights (see [6]). We present results from the SBM model with the smallest description length (entropy) out of five attempts (59166727 vs. 59259313, 59320690, 59400638, and 59677231, truncated to natural numbers) [7].
The hierarchical SBM provides a rigorous inferential approach to complement the Louvain hierarchy. Although the SBM does not specifically infer partitions based on network assortativity, its implementation as described in [5,6,7], and implemented in the graph-tool library, makes minimal model assumptions and can infer clusterings that might stem from a number of different generative processes besides assortativity. We present statistics for both the Louvain and SBM clusterings by level in Tables 4 and 5; the former produces three levels, while the SBM finds 12. Figures   Much of our Louvain-based analysis makes use of modularity: both the definition of clusters themselves and the ranking of prototypical self-labels per cluster rely on modularity as a quantity. To compare the two clustering approaches, we calculate modularity for both at each level. As expected, the Louvain algorithm increases modularity of its partition at every level. The SBM, in contrast, seeks to minimize the minimum description length of its inferred nested partition, and therefore need not maximize modularity at any level [5]. It is noteworthy, then, that SBM-produced partitions increase in modularity with every level up through level 6 (    SBM level via their normed maximum partition overlap similarity score, provided by the graph-tool partition_overlap function [8,4], presented in Table 5. We choose SBM level 6 as the upper level for calculating coherence as the level that achieves maximum modularity and maximum overlap with the Louvain top-level clustering. We find that the hierarchical SBM produces a first-level cluster landscape that "breaks up" the larger, high-coherence clusters found in the first level produced by the Louvain algorithm (see Figure 9). However, the SBM coalesces the clusters containing the user vertices common to these high-coherence Louvain clusters, recovering them increasingly with rising level. In the main text, we highlight these highest-coherence bottom-level Louvain clusters. We provide evidence in Figure 10 that these particular high-coherence clusters are significantly recovered as SBM layer increases. We match each Louvain cluster to the singular SBM cluster in each level that shares the greatest overlap in user nodes, then plot the its coherence score for the Louvain-based clustering against the overlap with its SBM counterpart. As the SBM level increases, we see that the high-coherence clusters are exactly those which are recovered, shown by the increased overlap with matched, single SBM clusters. The log-x plots show that a population of the smallest, least coherent clusters are "found" by SBM by the first layer; it takes increasing layers to recover more of the larger, more coherent Louvain bottom-level clusters to the same degree.
Overall, it seems as though the hierarchical SBM produces more hierarchical levels, with the lower levels consisting of smaller and more uniformly-sized clusters (Table 4 vs. Table 5; Figure 9). This is not unexpected, as seen in comparisons between generative model-agnostic SBM clusterings vs. "planted partition" SBMs with assortativity-seeking generative models: the general SBM tended to break up the larger clusters found by the planted partition SBM [10]. Unfortunately there was no option to use the planted partition model hierarchically, as accomplished in [5] and [6], since the planted partition model has not been implemented in a hierarchical way in the graph-tool software as of May 2022.
We calculate coherence distributions for lower SBM levels with respect to SBM level 6, established earlier. Only by level 3 does the coherence distribution reach the original Louvain high-coherence range ( Figure 11). Therefore, we use SBM levels 3 and 4 vs. 6 as our lower-vs. upper-level analogs for the Louvain bottom-and top-level clusterings presented in the main text. Figure 12 shows a direct positive relationship between SBM-and Louvain-derived coherence scores, where clusters between models are matched via overall cluster overlap maximization and implemented in scipy's linear_sum_assignment function.
Tables 6, 7 and 8 show the top-10 prototypical self-labels for high-coherence lower-level clusters in Louvain and SBM cases. In terms of the face-validity of labels describing each cluster, SBM recovers many of the original Louvain-dependent results, with some additional clusters above the 0.05 bit threshold. However, the point is not to recover clusterings exactly, but to show a consistent variation in the structure of the identity cluster hierarchies-one where particularly coherent communities provide more information about their encompassing higher-level identity landscapes than others. Both Louvain-and SBM-derived measurements establish this, with significant recovery of specific results.    Figure 9, but plotted against the Louvain clusters' coherence scores. The high-coherence Louvain bottom-level clusters defined in Figure 5 and shown in Figure 2, main text, tend to be recovered increasingly by the SBM with rising level. Compare Table 6 to Tables 7 and 7 to see recovery of the prototypical labels of high-coherence Louvain clusters from the SBM.    Figures 5 and 11. We find a positive relationship between SBM and Louvain-derived coherence measurements.  #mufc  #music  #gamer  #glazersout  #choir  #twitch  #teamjesus  #acoustic  #ps4  #luhg  #doggos  #twitchaffiliate  #utfr  #rain  #xbox  #woodwardout  #giggles  #streamer  #mufc  #series  #overwatch  #kwankwasiyya  #newsong  #fortnite  #oleout  #flute  #twitchkittens  #martialfc #slander #ffxiv Table 6: Top-10 prototypical labels for bottom-level Louvain clusters with respect to the top (3rd) level, found above the 0.05 bit threshold defined in Figure 5, ordered by decreasing coherence. Each of these is presented with only its top-4 labels in Figure 2a, main text. Compare to the SBM-based labels in Tables 7 and Table 8.  Table 7: Top-10 prototypical labels for level 3 SBM clusters with respect to SBM level 6, found above the 0.05 bit threshold defined in Figure 5, ordered by decreasing coherence. Compare to the Louvainbased labels, where markers such as (f, 0.60, 0.57) indicate the matched Louvain cluster in Table 6, its associated cluster overlap score (set intersection normalized by Louvain cluster cardinality), and the matching Jaccard index (set intersection normalized by the union of both sets). See §3 for why SBM levels 3 and 6 were selected to calculate coherence. Clusters between models are matched via overall cluster overlap maximization and implemented in the scipy library's linear_sum_assignment function.
Cluster  Table 8: Top-10 prototypical labels for level 4 SBM clusters with respect to SBM level 6, found above the 0.05 bit threshold defined in Figure 5, as in Table 7. Compare to Table 6 containing Louvain-based labels. Markers such as (f, 0.68, 0.39) indicate the matched Louvain cluster in Table 6, its associated cluster overlap score (set intersection normalized by Louvain cluster cardinality), and the matching Jaccard index (set intersection normalized by the union of both sets). See §3 for why SBM levels 4 and 6 were selected to calculate coherence. Clusters between models are matched via overall cluster overlap maximization and implemented in the scipy library's linear_sum_assignment function.  Figure 7 in this document and Figure 3 in the main text. The the 95% null CI bands for all four levels were generated from over 10,000 simulations using the same method as the Louvain case, replacing the Louvain algorithm with hierarchical SBM.

Conspicuousness at different SBM levels
We observe similar conspicuousness results for SBM as in the Louvain case (Figure 13). At the first SBM level, cluster size is curtailed overall in both empirical and null results (as discussed earlier in this document), and empirical conspicuousness is largely contained in the null region. As level increases beyond that base, we see results much more akin to our Louvain results (Figure 7 in this document and Figure 3 in the main text). Particularly by levels 3 & 4, we see conspicuousness diverging from null at larger cluster size-also observed in the Louvain-based analysis.

Hashtag glossary for main text
#coys "come on you spurs": the Tottenham Hotspur Football Club #ttp "trust the process" (in the sports context): refers to the long-term strategic coaching strategy of the Philadelphia 76ers